BeautifulSoup 2: Bloomberg

This notebooks shows how to get an index name and value (also for multiple indicies) from the Bloomberg page. Reminder: first of all you need to check the robots.txt file to see whether we are allowed to scrape or not (being nice) and then go on, if everything is okay.

The current value of S&P index

Let's find the current value of S&P index. Again, we should import the 2 libraries for getting the HTML page (requests) and for dealing with it (BeautifulSoup).


In [1]:
import requests
from BeautifulSoup import *

In [2]:
url = "http://www.bloomberg.com/quote/SPX:IND"

In [3]:
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page)

If you check use the Inspect element feature from Google Chrome, you will see that the name of the index is inside an <h1> tag which has a class = name attribute. Let's first find all the <h1> guys from the page.


In [4]:
soup.findAll('h1')


Out[4]:
[<h1 class="bb-nav-logo" is="bb-nav-logo"> <div class="bb-nav-logo__rubix" data-tracker-label="bomb.open.rubix_button" data-tracker-action="click"></div> <div class="bb-nav-logo__menu" data-tracker-label="bomb.open.menu_button" data-tracker-action="click"> <div class="bb-nav-logo__menu-link">MENU</div> </div> <a href="http://www.bloomberg.com" class="bb-nav-logo__link" data-tracker-label="logo" data-tracker-action="click"></a> <div class="bb-nav-logo__arrow"></div> </h1>,
 <h1 class="bb-nav-content-logo__headline" data-tracker-label="content"> <a href="http://www.bloomberg.com" class="bb-nav-content-logo__site" data-tracker-label="logo" data-tracker-action="click"></a> </h1>,
 <h1 class="name"> S&amp;P 500 Index </h1>]

Obviously, this is not what we want (index name). Thus, we shoudl explicitly mention the attributes we are looking for:


In [5]:
index_name = soup.findAll('h1', attrs={'class': 'name'})
print(index_name)


[<h1 class="name"> S&amp;P 500 Index </h1>]

Now, this works, we ere able to find the correct tag. Let's get the text out of it.


In [6]:
print(index_name[0].text)


S&amp;P 500 Index

The reason we used index_name[0] notation is that index_name is a list (although consisting of only one element, but still a list). The text methods works only on strings/tags, so we had to explicitly choose the tag from the list and then apply text method on it.

Similarly, we can use the Inspect element feature to understand where the index value is lying in and get it. Then we can get the text part out of it.


In [7]:
index_value = soup.findAll('div', attrs={'class':'price'})
print(index_value[0].text)


2,429.39

Multiple indices (print)

If one is interested in getting data on multiple indicies then you just need to specify all the urls you want to get data from inside a list, and then create a for loop that will iterate over that list, send a request, get the response and convert to text(as we did above and before), pass it as an argument to BeautifulSoup() function and get the index name and value as done above.


In [8]:
urls = ["https://www.bloomberg.com/quote/DM1:IND",
        "https://www.bloomberg.com/quote/UKX:IND",
        "https://www.bloomberg.com/quote/EURUSD:CUR" ]

In [9]:
for url in urls:
    response = requests.get(url)
    page = response.text
    soup = BeautifulSoup(page)
    index_name = soup.find("h1",attrs={'class':'name'})
    index_value = soup.find("div",attrs={'class':'price'})
    print(index_name.text+": "+index_value.text+"\n")


Dow Jones mini Futures: 21,207.00

FTSE 100 Index: 7,511.87

EURUSD Spot Exchange Rate: 1.1187

Multiple indices (save to dict)

If one is interested in saving the results, let's say into a dictionary, then you should first create an empty dictionary, then update its content by usning the update() function:


In [10]:
my_data = {}

for url in urls:
    response = requests.get(url)
    page = response.text
    soup = BeautifulSoup(page)
    index_name = soup.find("h1",attrs={'class':'name'})
    index_value = soup.find("div",attrs={'class':'price'})
    my_data.update({index_name.text:index_value.text})

Let's pretty print the content of the dictionary.


In [11]:
from pprint import pprint
pprint(my_data)


{u'Dow Jones mini Futures': u'21,207.00',
 u'EURUSD Spot Exchange Rate': u'1.1186',
 u'FTSE 100 Index': u'7,511.87'}

Being nice

Once you create a for loop to send a request, get the data and so on, you should be attentive not to overwhelm the website server. For that purpose you may want to ask your for loop to sleep a bit between each iteration (let's say 10 seconds). We may use the time library and sleep() function from that library to make for loop sleep.


In [12]:
# This is the same code as above with 2 additional lines
import time # importing the library
my_data = {}
for url in urls:
    response = requests.get(url)
    page = response.text
    soup = BeautifulSoup(page)
    index_name = soup.find("h1",attrs={'class':'name'})
    index_value = soup.find("div",attrs={'class':'price'})
    my_data.update({index_name.text:index_value.text})
    time.sleep(10) # asking for loop to sleep 10 seconds

Usually, different websites mention in their documentation how long you need to sleep before each request. THe average duration is 30 seconds.

Write to CSV (with datatime)

We may want to save the resutls into a CSV file. For that reason we must use the CSV library. If you also want to save the date and the time, then the datetime library will be useful:


In [13]:
import csv
from datetime import datetime
with open("index_data.csv","w") as file: # create a new file for writing purposes
    writer = csv.writer(file)
    for i in my_data:
        writer.writerow([i,my_data[i],datetime.now()])

Construct a dataframe

You may want to construct a pandas dataframe from the dictionary we had. the DataFrame.from_dict() functino from the pandas library might be useful. THe function takes two arguments: first the dictionary to get the data from, and second argument, which shows whether the dictionary keys should become row names (index) or column names in our dataframe. Let's make them row names.


In [14]:
import pandas as pd
data = pd.DataFrame.from_dict(my_data,"index")
print(data)


                                   0
FTSE 100 Index              7,511.87
EURUSD Spot Exchange Rate     1.1187
Dow Jones mini Futures     21,207.00

Transpose the dataframe

If you wanted to have dictionary keys as column names, you may transpose the dataframe as follows:


In [15]:
data.transpose()


Out[15]:
FTSE 100 Index EURUSD Spot Exchange Rate Dow Jones mini Futures
0 7,511.87 1.1187 21,207.00

And of course, if we are dealing with pandas, we can easily save the dataframe to csv as follows.


In [16]:
data.transpose().to_csv("index_dataframe.csv")